Inventi Impact: Digital Multimedia Broadcasting

Articles

Inventi:edmb/113243/25

Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization

01-Oct-2025 Review 2025 : October-December

Nan Chen, Tie Xu, Mingrui Sun, Chenggui Yao, Dongping Yang

The video transformer model, a deep learning tool relying on the self-attention mechanism, is capable of efficiently capturing and processing spatiotemporal information in videos through effective spatiotemporal modeling, thereby enabling deep analysis and precise understanding of video content. It has become a focal point of academic attention. This paper first reviews the classic model architectures and notable achievements of the transformer in the domains of natural language processing (NLP) and image processing. It then explores performance enhancement strategies and video feature learning methods for the video transformer, considering 4 key dimensions: input module optimization, internal structure innovation, overall framework design, and hybrid model construction. Finally, it summarizes the latest advancements of the video transformer in cutting-edge application areas such as video classification, action recognition, video object detection, and video object segmentation. A comprehensive outlook on the future research trends and potential challenges of the video transformer is also provided as a reference for subsequent studies.

How to Cite this Article
Attribution/ CC Compliant Citation: Nan Chen, Tie Xu, Mingrui Sun, Chenggui Yao, Dongping Yang. Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization. Intell Comput. 2025; 4:0143. https://doi.org/10.34133/icomputing.0143; https://creativecommons.org/licenses/by/4.0/; Some formatting elements, header, footer, logos, dates and pagination were modified while adapting this article.
Download Full Text

Call Us: +4 (800) 888-0008

Inventi Impact: Digital Multimedia Broadcasting

Articles

Inventi:edmb/113243/25

Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization

How to Cite this Article

Links

Contact Us